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ABSTRACT 

The min-max kernet is a generalization of the populär re- 
semblance kernet (which is designed for binary data). In 
this paper, we demonstrate, through an extensive Classifica¬ 
tion study using kernel machines, that the min-max kernet 
often provides an effective measure of similarity for nonneg¬ 
ative data. As the min-max kernel is nonlinear and might be 
difficult to be used for industrial applications with massive 
data, we show that the min-max kernel can be linearized via 
hashing techniques. This allows practitioners to apply min- 
max kernel to large-scale applications using well matured 
linear algorithms such as linear SVM or logistic regression. 

The previous remarkable work on consistent weighted sam¬ 
pling (CWS) produces samples in the form of ( i*,t *) where 
the i* records the location (and in fact also the weights) 
Information analogous to the samples produced by classical 
minwise hashing on binary data. Because the t* is theo- 
retically unbounded, it was not immediately clear how to 
effectively implement CWS for building large-scale linear 
classifiers. In this paper, we provide a simple solution by 
discarding t* (which we refer to as the “0-bit” scheme). Via 
an extensive empirical study, we show that this 0-bit scheme 
does not lose essential information. We then apply the “0- 
bit” CWS for building linear classifiers to approximate min- 
max kernel classifiers, as extensively validated on a wide 
ränge of publicly available Classification datasets. 

We expect this work will generate interests among data min- 
ing practitioners who would like to efhciently utilize the non¬ 
linear information of non-binary and nonnegative data. 


1. INTRODUCTION 

Nonnegative data are common in practice and the exis- 
tence of negative entries in a dataset is often due to shifting 
or normalization. In this paper we show that the min-max 
kernel can provide an effective measure of similarity for 
nonnegative data and should be useful for building effective 
large-scale data mining tools via hashing techniques. 


Given two nonnegative data vectors, u, v £ R D , we dehne 


min-max : 


Kmm(u, v) = 


Efai min ( M 0 V i) 

EiLj max{«i, Ui} 


(1) 


which is a generalization of the well-known resemblance: 

EiLi 1 («i > 0 and Vi > 0} 


resemblance : Kr(u,v) = 


J2i=i > 0 or Vi > 0} 


( 2 ) 


The resemblance is a populär measure of similarity for bi¬ 
nary data Bl BOI- The prior work [22j used the term “re¬ 
semblance kernel” because the resemblance can be written 
as the (expectation) of an inner product (and hence it is a 
positive deßnite kernel). It will be soon clear that Kmm © 
can also be written as the expectation of an inner product. 

Readers (e.g., those from Computer vision) probably have 
realized that the min-max kernel deßned in © is related to 
the following so-called intersection kernel flf : 

D 

intersection: Ki(u, v) = min{uj, Ui}, (3) 

i= 1 

D D 

^2 u i = i, ^> = 1 
i=1 i= 1 

In this paper, we will extensively compare the min-max ker¬ 
nel with the intersection kernel in the context of kernel ma¬ 
chines for classißcation. Interestingly, for most datasets in 
our experimental study, the min-max kernel outperforms the 
intersection kernel, and in some cases signißcantly so. Of 
course, another advantage of the min-max kernel is the ex- 
istence of hashing techniques |24l fl4| to approximate this 
nonlinear kernel by linear kernel (at least conceptually). 


The sum-to-one normalization in the deünition of intersec¬ 
tion kernel © appears natural, since the data vectors (e.g., 
u and v) were treated as histograms when the intersection 
kernel was designed. For our curiosity, we also dehne, what 
we call, the “normalized min-max kernel” as follows: 


n-min-max : 


Knmm(u, v) 


S^min-jui, m} 
Efcr max{Mi, Vi} 


D D 

y in =!, = 1 

i =1 i=l 


( 4 ) 


Our experiments will show that, for most datasets, this nor¬ 
malization step only affects the classihcation accuracies very 
marginally, although there are also exceptions. In this pa¬ 
per, we often use “min-max kerneis” to refer to both the 
min-max kernel and the n-min-max kernel. Note that the 
normalization step is conducted before applying hashing, 
which means that these two kerneis are no different as far 
as the research on hashing is concerned. 


It is worth mentioning that the above three kerneis (min- 
max, intersection, and n-min-max) have no tuning Parame¬ 
ters. Thus, it is often possible to further improve the per- 
formance by, for example, using multiple kerneis or kerneis 
combined in a special fashion (e.g., the CoRE kerneis m 
by multiplying resemblance with correlation). 





We will compare these three types of parameter-free ker- 
nels with the basic (tuning-free) kernel: 

D 

linear: K p (u,v ) = (5) 

i= 1 

= = 1 
i= 1 i=l 

For convenience, we enforce the normalization (to unit length) 
because in practice (e.g., when running linear SVM) the nor¬ 
malization step is typically recommended. 

The min-max kernel was sparsely discussed in the liter- 
ature |241 114 |. In contrast, the resemblance kernel © has 
been widely used in practice on binary (or binarized) data 01 
El ng El M 0 El El El Ql [© [p . For example, [22] demon- 
strated the use of &-bit minwise hashing m for training 
large-scale (resemblance kernel) SVM and logistic regression. 

Summary of our contributions: This paper aims at 

addressing several interesting and important issues regard- 
ing the use of min-max kerneis for data mining applications: 

1. Why using min-max kemels? Table [T] and Figures [[] 
to E] provide an extensive empirical study of kernel 
SVMs for Classification on a sizable Collection of public 
datasets, for comparing linear kernel, min-max kernel, 
n-min-max kernel, and intersection kernel. The results 
illustrate the advantages of the min-max kerneis over 
the linear kernel as well as the intersection kernel. 

2. The “0-bit” CWS hashing for min-max kemels. 

The remarkable prior work on consistent weighted sam¬ 
pling (CWS) provides a recipe to sample min-max ker- 
nels (i.e., the collision probability of the samples is 
the min-max kernel), in the form of ( i*,t *). Because 
t* is theoretically unbounded, it was not immediately 
clear how to effectively implement a “fo-bit” version of 
CWS which is needed in order to apply the method for 
large-scale industrial applications. We provide a (sur- 
prisingly) simple solution by completely discarding t* 
(after hashing), which we refer to as the “0-bit” scheme 
and is validated by a large set of experiments. 

3. Large-scale learning with (modified) CWS hashing. In 
light of our contributions 1 and 2, we apply the pro- 
posed 0-bit CWS hashing for efficiently building large- 
scale linear classifiers approximately in the space of 
min-max kerneis, as verified by extensive experiments. 

2. KERNEL SVM EXPERIMENTS 

In this section, we present an experimental study for Clas¬ 
sification using kernel machines based on the four types of 
kerneis we have introduced: the linear kernel, the min-max 
kernel, n-min-max kernel, and the intersection kernel. To 
simplify the experimental procedure, we use LIBSVM pre- 
computed kernel functionality and Z 2 -regularization. Table[T] 
summarizes the test Classification accuracies. 

While these kerneis do not have tuning Parameters, there 
is a regularization parameter C for Z 2 -regularized SVM. To 
ensure repeatability, we report the test Classification accu¬ 
racies for a wide ränge of C values from 1CF 2 to 10 3 with 
a fine grid, in Figures [T] to El The accuracies reported in 
Table [T] are the (individually) highest points on the curves. 


The results in Table [T] and Figures [1] to El confirm that 
using min-max kerneis typically result in better Classification 
performance compared to linear kernel as well as intersection 
kernel. This experimental study, to an extent, help justify 
the use of min-max kerneis in learning applications. 
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Figure 1: Test Classification accuracies for four types 
of kerneis using / 2 -regularized SVM (with a tuning 
Parameter C, i..e, the x-axis.). Each panel presents 
the results for one particular dataset (see more data 
information in Table flT). The two solid curves rep- 
resent the min-max kernel (red, if color is available) 
and the n-min-max kernel (green, if color is avail¬ 
able). The dashed curve (blue) and the dot dashed 
(black) curve represent, respectively, the linear ker¬ 
nel and the intersection kernel. See Figures l2l and l3l 
for the results on more datasets. 
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Figure 2: Test Classification accuracies for four types Figure 3: Test Classification accuracies for four types 
of kerneis using / 2 -regularized SVM. of kerneis using / 2 -regularized SVM. 

















































































































































































































































































Table 1: Classification accuracies (in %) for using 4 different kerneis. We use LIBSVM “pre-computed” kerneis 
and l 2 -regularized kernel SVM (which has a tuning parameter C). The reported test Classification accuracies 
(i.e., the rightmost 4 columns) are the best accuracies from a wide ränge of C values; see Figures [l] to [3] for 
more details. The datasets are all public (and mostly well-known), from various sources including the UCI 
repository, the LIBSVM web site, the book web site of m, and the two papers | l7l ll8 | | which compared deep 
nets, boosting and trees, kernel SVMs, etc. (Also see http://hunch.net/?p=1467 for interesting discussions.) 

Whenever possible, we use the conventional partitions of training and testing sets. We have made efforts 
to ensure the repeatability of our experiments by using pre-computed kerneis and reporting the results for 
a very wide ränge of C values. However, this strategy also limits the scale of our experiments because 
most Workstations do not have sufficient memory to störe the kernel matrix for datasets of even moderate 
sizes (for example, a merely 60,000 x 60,000 kernel matrix has 3.6 x 10 9 entries). Therefore, for the sake of 
repeatability, for a few datasets we only use a subset of the samples. Please feel free to contact the author if 
more Information is needed in Order to reproduce the experiments. Several special notes about the datasets: 

(i) Whenever possible, we always use the data “as they are” from the sources. Although we agree it is a very 
important research task to study how to transform the data to favor certain type of similarities, it is not the 
focus of our paper (and may hurt the repeatability of the experiments if we try to alter the data), (ii) Several 
datasets downloaded from the LIBSVM site were already scaled to [-1, 1]. To make use of these datasets, we 
simply transform them by (z + l)/2, where 2 is the original feature value. 


Dataset 

# train samples 

# test samples 

linear 

min-max 

n-min-max 

intersection 

CovertypelOk 

10,000 

50,000 

70.9 

80.4 

80.2 

74.3 

Covertype20k 

20,000 

50,000 

71.1 

83.3 

83.1 

75.2 

IJCNNök 

5,000 

91,701 

91.6 

94.4 

95.3 

94.0 

IJCNNlOk 

10,000 

91,701 

91.6 

95.7 

96.0 

94.5 

Isolet 

6,238 

1,559 

95.4 

96.4 

96.6 

96.4 

Letter 

16,000 

4,000 

62.4 

96.2 

95.0 

92.1 

Letter4k 

4,000 

16,000 

61.2 

91.4 

90.2 

87.9 

M-Basic 

12,000 

50,000 

90.0 

96.2 

96.0 

93.4 

M-Image 

12,000 

50,000 

70.7 

80.8 

77.0 

76.2 

MNISTlOk 

10,000 

60,000 

90.0 

95.7 

95.4 

93.1 

M-Noisel 

10,000 

4,000 

60.3 

71.4 

68.5 

68.2 

M-Noise2 

10,000 

4,000 

62.1 

72.4 

70.7 

70.0 

M-Noise3 

10,000 

4,000 

65.2 

73.6 

71.9 

71.6 

M-Noise4 

10,000 

4,000 

68.4 

76.1 

75.2 

74.8 

M-Noise5 

10,000 

4,000 

72.3 

79.0 

78.4 

77.9 

M-Noise6 

10,000 

4,000 

78.7 

84.2 

84.3 

83.9 

M-Rand 

12,000 

50,000 

78.9 

84.2 

84.1 

83.7 

M-Rotate 

12,000 

50,000 

48.0 

84.8 

83.9 

60.8 

M-Rotlmg 

12,000 

50,000 

31.4 

41.0 

38.5 

37.0 

Optdigits 

3,823 

1,797 

95.3 

97.7 

97.4 

96.8 

Pendigits 

7,494 

3,498 

87.6 

97.9 

98.0 

97.5 

Phoneme 

3,340 

1,169 

91.4 

92.5 

92.0 

91.6 

Protein 

17,766 

6,621 

69.1 

72.4 

70.7 

69.6 

RCV1 

20,242 

60,000 

96.3 

96.9 

96.9 

96.7 

Satimage 

4,435 

2,000 

78.5 

90.5 

87.8 

86.9 

Segment 

1,155 

1,155 

92.6 

98.1 

97.5 

97.0 

SensIT20k 

20,000 

19,705 

80.5 

86.9 

87.0 

85.5 

Shuttlelk 

1,000 

14,500 

90.9 

99.7 

99.6 

99.6 

Spam 

3,065 

1,536 

92.6 

95.0 

94.7 

94.2 

Splice 

1,000 

2,175 

85.1 

95.2 

94.9 

93.8 

USPS 

7,291 

2,007 

91.7 

95.3 

95.3 

94.8 

Vowel 

528 

462 

40.9 

59.1 

53.5 

49.8 

WebspamNl-20k (1-gram) 

20,000 

60,000 

93.0 

97.9 

97.8 

96.6 

YoutubeVision 

11,736 

10,000 

63.3 

72.4 

72.4 

70.8 







The purpose of this experimental study on kernel SVMs 
is not try to show that min-max kerneis achieve the best 
Classification accuracies. In fact, compared to trees or deep 
nets [171 118) . simply using min-max kerneis usually does 
achieve the best accuracies, although the results are close. 
Since min-max kerneis have no tuning Parameters, we can 
expect to boost the performance by using additional Param¬ 
eters or by combining multiple the same (or different) types 
of kerneis. For example, using the idea from CoRE ker- 
nels na, we can multiply min-max kernel with chi-square 
kerneis (which can be hashed by sign cauchy projections [21)1. 

For large-scale industrial applications, typically it is dif- 
ficult to directly use (nonlinear) kerneis. Fortunately, with 
CWS (consistent weighted sampling), we can linearize the 
min-max kernel. In other words, it is possible to achieve the 
good performance of min-max kerneis at the cost of linear 
kerneis. In this paper, we will show how to do CWS better. 

3. HASHING MIN-MAX KERNEL 

The Classification experiments reported in Table [l] and 
Figures [T] to [3] have demonstrated the effectiveness of min- 
max kerneis in terms of prediction accuracies. However, in 
order to make min-max kerneis practical for large-scale data 
mining tasks, we need to resort to hashing techniques to (ap- 
proximately) transform nonlinear kerneis into linear kerneis. 

It is well understood [3] that computing kerneis are expen- 
sive and the kernel matrix, if fully materialized, does not fit 
in memory even for relatively small applications. In con- 
trast, highly efhcient linear algorithms, e. g ., na ei e mg, 
have been widely used in practice for truly large-scale appli¬ 
cations such as click predictions in online advertising [25] . 

3.1 Consistent Weighted Sampling (CWS) 

The prior efforts [24 ;. 14] have lead to the method called 
“consistent weighted sampling (CWS)” for hashing min-max 
kerneis. Here, we adopt the beautiful description of CWS 
in [14] as shown in Alg. |T] 


Algorithm 1 Consistent Weighted Sampling (CWS) 

Input: Data vector u = ( Ui > 0, z = 1 to D ) 

Output: Consistent uniform sample (z*, t*) 

For z from 1 to D 

ri ~ Gamma( 2,1), Cj ~ Gamma( 2,1 ), ßi ~ Uniform(0 ,1) 
ti i- [ lo 8^ +ßi\, yi -s- exp (ri(ti—ßi)), a t <- a/(yi exp(ri)) 
End For 

i* <—arg mini t* t— tj» 

Given a data vector u € R D , Alg.[l]provides the procedure 
for generating one CWS sample (i*,t*). In order to generate 
k such samples, we have to repeat the procedure k times 
using an independent set of random numbers ri, a, ßi. For 
clarity, we denote the samples for data vectors u and v as 

(C p,Cp) and (Cp,Cp), j = l,2,..,,k (6) 

Basically we need to generate 3 matrices: {r}, {c}, and 
{/?}, of size D x k. All the data vectors will use the same 
3 matrices. This is essentially the same cost as random 
projections (which however approximate linear kerneis). 

The basic theoretical result of CWS says the “collision 
probability” is exactly Kmm. 

Pr {(Cp,Cp) = (Cp,Cp)} = K M m(u,v) (7) 


Thus, it is clear that, at least conceptually, we can express 
Kmm{u, v) as the expectation of an inner product and lience 
Kmm is positive definite, just like how [22] showed the re- 
semblance is a type of positive definite kernel. 

3.2 Drawback of CWS for Data Mining 

Although the basic probability result ([7]) says conceptu¬ 
ally we can use CWS for building linear classifiers (approx- 
imately in the space of min-max kerneis), it is not immedi- 
ately clear how it can be implemented efficiently. 

El briefly mentioned that one can “uniformly map” the 
sample space ( i*,t *) to a space b bits: {0,1,2,..., 2 b — 1}. 
This however can not be (easily) achieved. While i* is 
bounded by D, t* is actually unbounded (see Alg. E). Also 
note that space of samples is very large. If we represent i* 
by bi bits and t* (approximately) by bt bits, the space will 
be 2 b ' +bt . Thus, we must find an efhcient representation of 
CWS samples in order to use this nice method effectively for 
machine learning and data mining applications. 

3.3 Our “0-bit” Proposal for CWS 

It is now known how to use ö-bit minwise hashing to ap¬ 
proximate the resemblance kernel and use it for large-scale 
applications [20i [22] . Therefore, in this paper, we focus on 
representing t*. Perhaps surprisingly, our proposal is simple: 
just ignore t* in the sample ( i*,t *), i.e., the “0-bit” scheme. 

If we examine Alg. [T] we can see that i* has already en- 
coded the Information about the weights of the data. A 
rigorous proof however turns out to be a difücult probabil¬ 
ity problem, which is outside the scope of this paper. Here, 
we try to empirically demonstrate the following observation: 

p i*{Cp —Cp} p f [(Cp 3 Cp) (Cp,Cp)} (3) 

We call our proposal the “0-bit” scheme only to mean that 
we use 0 bit for coding t*. We also call the original proposal 
as the “full” scheme since it Stores all the bits needed for t*. 

3.4 An Experimental Study on “0-bit” CWS 

Table 2: Information of the 13 pairs of English words. 
For example, “HONG” refers to the vector of occurrences 
of the word “HONG” in 2 16 documents. /i and /2 are the 
numbers of nonzeros in word 1 and word 2 respectively. 
For each pair, we include the numerical values for both 
the resemblance (“R”) and the min-max kernel (MM). 


Word 1 

Word 2 

fi 

h 

R 

MM 

A 

THE 

39063 

42754 

0.6444 

0.3543 

ADDICT 

PRICELESS 

77 

77 

0.0065 

0.0052 

AIR 

DOCTOR 

3159 

860 

0.0439 

0.0248 

CREDIT 

CARD 

2999 

2697 

0.2849 

0.2091 

GAMBIA 

KIRIBATI 

206 

186 

0.7118 

0.6070 

HONG 

KONG 

940 

948 

0.9246 

0.8985 

OF 

AND 

37339 

36289 

0.7711 

0.6084 

PAPER 

REVIEW 

1944 

3197 

0.0780 

0.0502 

PIPELINE 

FLUSH 

139 

118 

0.0158 

0.0143 

SAN 

FRANCISCO 

3194 

1651 

0.4758 

0.2885 

THIS 

TODAY 

27695 

5775 

0.1518 

0.0658 

TIME 

JOB 

37339 

36289 

0.1279 

0.0794 

UNITED 

STATES 

4079 

3981 

0.5913 

0.5017 


Table [2] lists 13 pairs of English words. Each word repre- 
sents a vector of occurrences of that word in a total of 2 16 
documents. This is a typical example of heavy-tailed data 
in that the weights vary dramatically. In common machine 
learning applications, the weights often do not vary as much 











(at least at the point when we are prepared to compute dis- 
tances/similarites from data). In that sense, we are actually 
testing our “0-bit” proposal in a more c.hallenging setting. 

We have experimented with many more pairs of words 
than these 13 pairs but the results look essentially the same, 
i.e., no practical difference between the 0-bit scheme and the 
full scheme, as can be shown in Figures [I]to [5] 

In the experiment, we let k vary from 1 to 1000 and esti- 
mate Kmm from k measurements ( ), j = 1 to k. With 
the full scheme, we keep all the bits of t*. With the 0-bit 
scheme, we completely discard t*. For each k, we repeat the 
simulations 10, 000 times to reliably compute the empirical 
mean square error (MSE) and the bias for each pair. 

The right columns of Figures [4] and [5] plot the empiri¬ 
cal MSEs, together with the theoretical variance: Kmm( 1 — 
K MM )/k (i.e., the variance of binomial). Because the curves 
for the 0-bit scheme and the full scheme overlap the theoret¬ 
ical variances, we can conclude, at least for these data, that 
our proposed 0-bit scheme is essentially unbiased and the 
variance matches the theoretical variance of the full scheme. 

To avoid many “boring” figures, we let k be as small as 1 
(while typical simulations would use a much large number 
such as 10 to start with). Nevertheless, these MSE curves 
are still quite boring since all the curves essentially overlap. 

To make the presentations somewhat more interesting, we 
also present the empirical biases in the left columns of the 
two figures. Now we can see some discrepancies between 
the two schemes typically on the Order of <C 10 -4 (in the 
stabilized zone, i.e., when k is not too small). While such 
small biases (at the 4th or 5th decimal points) would not 
make any practical differences, they do serve the purpose to 
remind us that the 0-bit scheme is indeed an approximation. 

To make the plots even more interesting, we add the curves 
for the “1-bit” scheme (i.e., by recording whether t* is even or 
odd). For “CREDIT-CARD”, “PIPELINE-FLUSH”, “SAN- 
FRANCISCO”, and “THIS-TODAY”, we can observe (very 
small) differences between the 0-bit scheme and the full- 
scheme. The differences vanish once we use the “1-bit” scheme. 

From Table[2] we can see that binarizing the data usually 
lead to very different similarities (i.e., the last two columns, 
i.e., R and MM, differ significantly). The 0-bit scheme, 
which only uses i* , still very well approximates the origi¬ 
nal min-max kernel instead of the resemblance kernel. This 
confirm that, even though our samples (i.e., i*) in the same 
format as samples from minwise hashing (for example, both 
are integers bounded by D), they are statistically very dif¬ 
ferent samples. In other words, our 0-bit scheme is not the 
same as simply doing the original minwise hashing. 

Finally, to entertain readers, we add Figure [Ü] to report 
the bias results by keeping all the bits of t* and only a few 
(0,1,2,4) bits of i*. Clearly, only using t* or t* with a few bits 
of i* will not lead to good estimate of the min-max kernel. 

4. KERNEL SVM WITH MODIFIED CWS 

We conduct a set of experiments by using “0-bit” CWS for 
approximately training min-max kernel SVMs. Basically, for 
each dataset, we apply CWS hashing for k up to 4096 and, 
after hashing, we discard t* only keep a matrix of {«*}, which 
has the same of number of rows as the number of examples 
in the dataset and k columns. We then use the populär LIB- 
LINEAR package m for training a linear SVM on the data 
generated by {z*}, following the scheme proposed by [22] . 






k k 

Figure 4: Results for estimating min-max kerneis using 
the “full” scheme by recording all the bits of ( i*,t *) and 
the “0-bit” scheme by discarding t *. For each word pair 
and k, we conducted simulations 10,000 times to com¬ 
pute the mean square errors (MSE) and the biases. The 
empirical MSE curves (right column) show that both the 
0-bit and the full scheme match the theoretical variance. 
The empirical biases (left column) present a magnified 
view of errors. For a few pairs (also see Figure [5j, the 
estimations by the 0-bit scheme have noticeable (<C 10 —4 ) 
biases. By using the “1-bit” scheme (i.e., by recording 
whether t* is even or odd), these biases vanish. 
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Figure 5: Simulations for estimating min-max ker- 
nels. See the caption of Figure [4] for more details. 
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Figure 6: The biases by using full information of t* 
and only a few (0, 1, 2, or 4) bits of i*. 


There is one important detail. In practice, since the space 
D is typically large, we often need to choose to störe only a 
few (say bi) bits of i*. In other words, after we obtain sample 
(i*,t*), we will use bi bits for storing i* and 0 bit for storing 
t*. The effective data matrix will be 2 6 * x k dimensions 
with exactly k l’s in each row. In our experimental study, 
we always use four choices of bi £ {1, 2, 4, 8}, corresponding 
to the four columns (from left to right) in Figures[7]and[S] 

Figure [3 presents the linear SVM experiments on a va- 
riety of datasets. In each panel, the two dashed curves 
(red/top and blue/bottom) correspond to the original test 
accuracies for the min-max kernel and the linear kernel (re- 
spectively). In each panel, the solid curves are the results 
for feeding the 0-bit CWS hashed data to LIBLINEAR, for 
k = 32,64,128,256,512,1024,2048,4096 (from bottom to 
top). For most of the datasets, we can see that the test ac¬ 
curacies approach the results of min-max kerneis, when k is 
large enough, especially if we use 8 bits to störe each i*. 

Figure [8] presents an interesting study for comparing the 
0-bit scheme (i.e., bt = 0 for t*) with the 2-bit scheme (i.e., 
bt = 2 for t*). We can see that once we use > 4 bits for i *, it 
makes no essential difference whether we use 0-bit or 2-bit 
scheme for t *, i.e., the solid and dashed curves overlap. 


5. DISCUSSION AND CONCLUSION 

We can view CWS as a tool for “feature engineering” in 
that it allows practitioners to generate data so that the inner 
Products of the transformed data approximate the min-max 
kernel values of the original data. We can then utilize ex- 
tremely efficient and scalable (batch or online) linear meth- 
ods to equivalently train a nonlinear SVM. In other words, 
we pay the price of linear learning for nonlinear learning. 

For certain applications, linear models based on the orig¬ 
inal data might be good enough. In that case, if there is a 
need for dimension reduction, we can use well-known ran¬ 
dom projection methods. For many datasets (e.g., TablcQJ, 
however, linear models are not sufficient and we often have 
to resort to nonlinear models and computationally intensive 
procedures. Interestingly, min-max kerneis are suitable for 
many nonnegative datasets, and hence developing efficient 
ways for approximating min-max kerneis becomes useful. 

Our contributions consist of three parts. Firstly, we con- 
duct an extensive empirical study of training nonlinear ker¬ 
nel SVMs using min-max kerneis, on a wide variety of public 
datasets. This study answers why we should consider using 
min-max kerneis instead of linear kerneis. Secondly, we pro- 
pose an efficient (and surprisingly simple) implementation 
of consistent weighted sample, called “0-bit” CWS, and we 
validate this proposal via an extensive Simulation study us¬ 
ing real text data. Finally, we show that the 0-bit CWS can 
be easily integrated into a linear learning System and we 
demonstrate, on a variety of datasets, that we can achieve 
the results of nonlinear SVMs by training linear SVMs. 










































































































































100 
Cs 90 

I 80 

CJ 

ü 

< 70 


60 

10 


Satimage 





—64— 

- 





732 





100 

95 

; 90 

! 

j 85 

: so 


Shuttlelk 

-i 128 

' n^ 


: J/Jj 













75 

10 

100 

90 

: 80 
u 
60 


10 10 10 10 10 10' 
100 


Splice 



— 






! 




■ n 



k = 32 






95 
' 90 

i 

j 85 

80 

75 

10 

100 

95 

; 90 

I 

i 85 

! a0 

75 

10 


USPS 











/ 




: H/ Z 









2 . „3 


10 10 10 10 10' 


WebspamNl-20k 



“^56- 


128"" 


64 

^ .: .7.k 

= 32 ^ 




10“ 2 IO“ 1 10° io 1 io 2 10 3 


100 



10“ 2 IO“ 1 10° io 1 io 2 10 3 



10“ 2 IO“ 1 10° io 1 io 2 10 3 


100 



10“ 2 IO“ 1 10° io 1 io 2 10 3 




100 



_P _1 o 1 ? 9 

10 10 10 10 10 10 



10 “ 2 IO “ 1 10° io 1 io 2 io 3 



_p _i n i ? q 

10 10 10 10 10 10 





_p _i n i p p 

10 10 10 10 10 10 


100 



—p _i n i p q 

10 10 10 10 10 10 



io “ 2 IO “ 1 10° io 1 io 2 io 3 



_p n i p p 

10 10 10 10 10 10 


USPS 




i. 








VI/ 



k = 32 


10 “ 2 IO ” 1 10° io 1 io 2 io 3 


WebspamN1-20k 








- - 



7^ 




k = 32 


io “ 2 io ” 1 10° io 1 io 2 io 3 
c 


Figure 7: Classification accuracies by using 0-bit CWS hashing and linear SVM. The original CWS algorithm produces 
samples in the form of (z*,£*). The 0-bit scheme discards t *. From left to right, the four columns represents the results 
for coding i* using 1 bit, 2 bits, 4 bits, and 8 bits, respectively. In each panel, the two dashed curves represent the 
original Classification results using min-max kernel (top and red) and linear kernel (bottom and blue). The solid curves 
are the results of linear SVM and 0-bit CWS with k = 32, 64, 128, 256, 512, 1024, 2048, 4096 (from bottom to top). 
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Figure 8: Classification accuracies by using linear SVM with 0-bit CWS (solid and black curves) and 2-bit CWS 
(dashed and red curves). The original CWS algorithm produces samples in the form of ( i*,t *). The 0-bit scheme 
discards t* while the 2-bit scheme keeps 2 bits for each t *. From left to right, the four columns represent the results for 
coding i* using 1 bit, 2 bits, 4 bits, and 8 bits, respectively. In each panel, the 3 solid curves (0-bit scheme for k =128, 
512, 2048) and the 3 dashed curves (2-bit scheme) essentially overlap especially when we use > 4 bits for coding i *. 
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